{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python38264bitpandassconda285c25d0d8784f5bba9542830bc5427b",
   "display_name": "Python 3.8.2 64-bit ('pandass': conda)"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
    {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table align=\"left\">\n",
    "  <td>\n",
    "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/exercise/starter-notebook.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
    "  </td>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A starter notebook with tasks listed to help you get started on your first machine learning project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First phase: Inspecting the data\n",
    "(this process is where you familiarize yourself with the data)\n",
    "\n",
    "Tasks:\n",
    "- Inspect the data\n",
    "- Check for null/missing values\n",
    "- View statistical details using ``describe()``\n",
    "\n",
    "Step one has been done for you \n",
    "After finishing the tasks think of the following:\n",
    "\n",
    "What can you infer from the statistical measures? like possible outliers? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Important note:**\n",
    "\n",
    "The data is already split into train.csv and test.csv, you generally would build your model using the train.csv **without** using any test.csv data as that would lead to overfitting your model\n",
    "\n",
    "#### Reminder: you do not have to stick to the tasks listed word for word, the tasks are listed as a guide to guide you.\n",
    "#### Feel free to explore more and do more on your own!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports \n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   PassengerId  Survived  Pclass  \\\n0            1         0       3   \n1            2         1       1   \n2            3         1       3   \n3            4         1       1   \n4            5         0       3   \n\n                                                Name     Sex   Age  SibSp  \\\n0                            Braund, Mr. Owen Harris    male  22.0      1   \n1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n2                             Heikkinen, Miss. Laina  female  26.0      0   \n3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n4                           Allen, Mr. William Henry    male  35.0      0   \n\n   Parch            Ticket     Fare Cabin Embarked  \n0      0         A/5 21171   7.2500   NaN        S  \n1      0          PC 17599  71.2833   C85        C  \n2      0  STON/O2. 3101282   7.9250   NaN        S  \n3      0            113803  53.1000  C123        S  \n4      0            373450   8.0500   NaN        S  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>PassengerId</th>\n      <th>Survived</th>\n      <th>Pclass</th>\n      <th>Name</th>\n      <th>Sex</th>\n      <th>Age</th>\n      <th>SibSp</th>\n      <th>Parch</th>\n      <th>Ticket</th>\n      <th>Fare</th>\n      <th>Cabin</th>\n      <th>Embarked</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>0</td>\n      <td>3</td>\n      <td>Braund, Mr. Owen Harris</td>\n      <td>male</td>\n      <td>22.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>A/5 21171</td>\n      <td>7.2500</td>\n      <td>NaN</td>\n      <td>S</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>1</td>\n      <td>1</td>\n      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n      <td>female</td>\n      <td>38.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>PC 17599</td>\n      <td>71.2833</td>\n      <td>C85</td>\n      <td>C</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>3</td>\n      <td>1</td>\n      <td>3</td>\n      <td>Heikkinen, Miss. Laina</td>\n      <td>female</td>\n      <td>26.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>STON/O2. 3101282</td>\n      <td>7.9250</td>\n      <td>NaN</td>\n      <td>S</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>4</td>\n      <td>1</td>\n      <td>1</td>\n      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n      <td>female</td>\n      <td>35.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>113803</td>\n      <td>53.1000</td>\n      <td>C123</td>\n      <td>S</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>5</td>\n      <td>0</td>\n      <td>3</td>\n      <td>Allen, Mr. William Henry</td>\n      <td>male</td>\n      <td>35.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>373450</td>\n      <td>8.0500</td>\n      <td>NaN</td>\n      <td>S</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 3
    }
   ],
   "source": [
    
    "project_url = 'https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/'\n",
    "data_path = 'master/001/data/'\n",
    "train = pd.read_csv(project_url+data_path+'train.csv')\n",
    "test = pd.read_csv(project_url+data_path+'test.csv')\n",
    "\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "   PassengerId  Pclass                                          Name     Sex  \\\n0          892       3                              Kelly, Mr. James    male   \n1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   \n2          894       2                     Myles, Mr. Thomas Francis    male   \n3          895       3                              Wirz, Mr. Albert    male   \n4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   \n\n    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  \n0  34.5      0      0   330911   7.8292   NaN        Q  \n1  47.0      1      0   363272   7.0000   NaN        S  \n2  62.0      0      0   240276   9.6875   NaN        Q  \n3  27.0      0      0   315154   8.6625   NaN        S  \n4  22.0      1      1  3101298  12.2875   NaN        S  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>PassengerId</th>\n      <th>Pclass</th>\n      <th>Name</th>\n      <th>Sex</th>\n      <th>Age</th>\n      <th>SibSp</th>\n      <th>Parch</th>\n      <th>Ticket</th>\n      <th>Fare</th>\n      <th>Cabin</th>\n      <th>Embarked</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>892</td>\n      <td>3</td>\n      <td>Kelly, Mr. James</td>\n      <td>male</td>\n      <td>34.5</td>\n      <td>0</td>\n      <td>0</td>\n      <td>330911</td>\n      <td>7.8292</td>\n      <td>NaN</td>\n      <td>Q</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>893</td>\n      <td>3</td>\n      <td>Wilkes, Mrs. James (Ellen Needs)</td>\n      <td>female</td>\n      <td>47.0</td>\n      <td>1</td>\n      <td>0</td>\n      <td>363272</td>\n      <td>7.0000</td>\n      <td>NaN</td>\n      <td>S</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>894</td>\n      <td>2</td>\n      <td>Myles, Mr. Thomas Francis</td>\n      <td>male</td>\n      <td>62.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>240276</td>\n      <td>9.6875</td>\n      <td>NaN</td>\n      <td>Q</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>895</td>\n      <td>3</td>\n      <td>Wirz, Mr. Albert</td>\n      <td>male</td>\n      <td>27.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>315154</td>\n      <td>8.6625</td>\n      <td>NaN</td>\n      <td>S</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>896</td>\n      <td>3</td>\n      <td>Hirvonen, Mrs. Alexander (Helga E Lindqvist)</td>\n      <td>female</td>\n      <td>22.0</td>\n      <td>1</td>\n      <td>1</td>\n      <td>3101298</td>\n      <td>12.2875</td>\n      <td>NaN</td>\n      <td>S</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 4
    }
   ],
   "source": [
    "test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for null/missing values \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect the statistical measures\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Second phase: Data Analysis and Visualization (this process is where you explore the data, clean it and infer some insights from it)\n",
    "Tasks:\n",
    "- Plot the gender distribution of passengers on board\n",
    "- Plot the rate of men and women who survived and who didnt survive\n",
    "- Plot the survival rate by \"Pclass\" (male and female counted together in each Pclass)\n",
    "- Plot the survival rate by males only in each \"Pclass\"\n",
    "- Plot the survival rate by females only in each \"Pclass\"\n",
    "\n",
    "Think about the plots that you just did, what can you infer from them?\n",
    "Now lets have a look at the columns, there are always unnecessary columns that do not contribute to the prediction.\n",
    "\n",
    "- Remove unnecessary columns from both train and test datasets\n",
    "- Handle categorical text values and turn them into numerical\n",
    "- Plot number of people who survived over age and passenger class\n",
    "\n",
    "- Deal with null values in the Age column\n",
    "- Group the data in the Age column into groups for better prediction.\n",
    "- Get survival rates by age groups \n",
    "- Plot the correlation between features and label\n",
    "\n",
    "\n",
    "What do you infer from the data? What can you conclude from it?who is most likely to survive, based on the data?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Third phase:\n",
    "### Building the model and testing the accuracy (this process is where you start building models and choose a model based on accuracy metrics)\n",
    "- Choose a model suitable for classification\n",
    "- Fit the data\n",
    "- Find out how \"well\" the model performs using some metrics\n",
    "- Use cross validation to get the avg accuracy on the model you chose\n",
    "- **Bonus 1** \n",
    "    - Try other classification algorithms \n",
    "    - Compare the accuracy metrics (including cross-validation) for the classification algorithms used by presenting them in a readable, easy to compare format\n",
    "    - Choose a model with the best cross validation accuracy metric\n",
    "- **Bonus 2**: Get the feature importance of your features using random forests\n",
    "    (if you are not sure what that is or how to do it, google it!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ]
}
